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needs  of  recognition.  This  essay  surveys  recent  work  in  vision  at  M.I.T. 
from  a perspective  in  which  the  representational  problems  assume  a primary 
importcince.  An  overall  framework  is  suggested  for  visual  information 
processing,  in  which  the  analysis  proceeds  through  three  representations; 

(1)  the  primal  sketch,  which  makes  explicit  the  intensity  changes  and  local 
two-dimens ianl  geometry  of  an  image,  (2)  the  2H-D  sketch,  which  is  a 
viewer-centered  representation  of  the  depth,  orientation  and  discontinuities 
of  the  visible  surfaces,  and  (3)  the  3-D  model  representation,  which 
allows  an  object-centered  description  of  the  three-dimentional  stucture 
and  organization  of  a viewed  shape.  Recent  results  concerning  processes 
for  constructing  and  maintaining  these  representations  summarized  and 
discussed. 
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SUMMARY:  Vitkin  U the  construction  of  effictent  ijmboHc  daicriptkins  from  images  of  the 
worM.  An  important  aspect  of  vision  it  the  ctioloe  of  representations  for  the  different  kinds  of 
Information  in  a visual  scene  In  the  early  stages  of  the  analysis  of  an  image,  the  representations 
used  depend  more  on  what  it  it  possible  to  compute  from  an  image  than  on  what  is  ultimately 
desirable,  but  later  representations  am  be  more  sensitive  to  the  specific  needs  of  recognition.  This 
essay  surveys  recent  work  in  vision  at  MJ.T.  from  a perspective  in  which  the  representational 
problems  assume  a primary  importance.  An  overall  framework  is  suggested  for  visual  information 
processing,  in  which  the  analysis  proceeds  through  three  representations;  (1)  the  primal  sketch, 
which  makes  explicit  the  intensity  changes  and  local  two^imensional  geometry  of  an  image,  (2)  the 
24  *D  sketch,  which  is  a viewer-centered  representation  of  the  depth,  orientation  and 
discontinuities  of  the  visible  surfaces,  and  (S)  the  S-D  model  representation,  which  allows  an  object- 
centered  description  of  the  three-dimensional  structure  and  organixation  of  a viewed  shape. 
Recent  results  concerning  processes  for  constructing  and  maintaining  these  representations  are 
summarized  and  discussed. 

This  report  deKribes  research  done  at  the  Artificial  Intelligence  Laboratory  of  the  Massachusetts 
institute  of  Technology.  Support  for  the  laboratory's  artificial  intelligence  research  Is  provided  in 
part  by  the  Advanced  Research  Projects  Agency  of  the  Department  of  Defense  under  Office  of 
Naval  Research  contract  N000I4-75-C-0643. 
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O:  IntroduntloB 

OJ:  UndtrstaiuUng  tnflrrmaUgn  pnusslng  tasks 

Vision  is  an  Information  processing  task,  and  Hke  any  other,  It  needs  understanding 
at  two  levels.  The  first,  which  I call  the  computational  theory  of  an  Information  processing  task,  is 
concerned  with  what  is  being  computed  and  why;  and  the  second  level,  that  at  which  parttcular 
algorlthms  are  designed,  with  how  the  computation  It  to  be  carried  out  (Marr  ie  Pogglo  1977a).  For 
example,  the  theory  of  the  Fourier  transform  is  a level  I theory,  and  is  expressed  Independently  of 
ways  of  obtaining  It  (algorithms  Hke  the  Fast  Fourier  Transform,  or  the  parallel  algorithms  of 
coherent  optics)  that  He  at  level  2.  Chomsky  callt  level  I theories  competence  theories,  and  level  2 
theories  performance  theories.  The  theory  of  a computation  must  precede  the  design  of  algorithms 
for  carrying  it  out,  because  one  cannot  icrlouily  contemplate  designing  an  algorithm  or  a program 
until  one  knows  predsely  what  It  Is  meant  to  be  doing. 

I believe  this  point  is  worth  emphasizing,  because  it  Is  Important  to  be  clear  about 
the  level  at  which  one  Is  pursuing  one^s  studies.  For  example,  there  hu  recently  been  much  interest 
In  so-called  cooperative  algorithms  (Marr  h Pogglo  1976)  or  relaxation  labelling  (Rosenfeld. 
Hummel  Jb  Zucker  1976).  The  attraction  of  this  iechnJ<|iic  is  that  it  allows  one  to  write  plausible 
constraints  directly  Into  an  algorithm,  but  one  mutt  remember  that  such  techniques  amount  to  no 
more  than  a style  of  programming,  and  they  lie  at  the  second  of  the  two  levels.  They  have 
nothing  to  do  with  the  theory  of  vision,  whose  buslneu  it  it  to  derive  the  constraints  and 
characterize  the  sohitlont  that  are  condstent  with  them. 

OJ:  Undtrstanding  vlsien 

If  one  accepts  in  broad  terms  this  statement  of  what  it  means  to  understand  an 
Information  processing  task,  one  can  go  on  to  ask  about  the  particular  theories  that  one  needs  to 
understand  vision.  Vision  can  be  thought  of  as  a process,  that  produces  from  images  of  the 
external  world  a description  that  Is  useful  to  the  viewer  and  not  cluttered  by  Irrelevant 
information.  These  descrlptioni.  in  turn,  are  built  or  assembled  from  many  different  but  fixed 
representations,  each  capturing  some  aspect  of  the  visual  scene  In  this  article,  I shall  try  to  present 
a summary  of  our  work  on  vision  at  M.LT.  seen  from  a perspective  in  which  the  representational 
problems  assume  a primary  Importance  I shall  Include  summaries  of  our  present  Ideas  as  well  as 
of  completed  work. 

The  Important  point  about  a representation  is  that  it  makes  certain  information 
txpltcU  (cf  the  principle  of  expHdt  naming,  Marr  I976X  For  example,  at  some  point  in  the  analysis 
of  an  image,  the  intensity  changes  present  there  need  to  be  made  explicit,  so  does  the  geometry  - 
of  the  Image  and  of  the  viewed  shape  - and  so  do  other  parameters  Hke  color,  motion,  position 
and  binocular  disparity.  To  understand  vision  thus  requires  that  we  first  have  some  Idea  of  which 


D.  Marr 


B 


ReprewnUng  visual  Information 


r 


I 


I 


representations  to  use,  and  then  we  can  proceed  to  analyze  the  computational  problems  that  arise  In 
obtaining  and  manipulating  each  representation.  Clearly  the  choice  of  representation  Is  crucial  In 
any  given  Instance,  for  an  Inappropriate  choice  can  lead  to  unwieldy  and  inefficient  computations. 
Fortunately,  the  human  visual  system  offers  a good  example  of  an  efficient  vision  processor,  and 
therefore  provides  Important  clues  to  the  representations  that  are  most  appropriate  and  likely  to 
yield  successful  solutions. 

This  point  of  view  places  the  nature  of  the  represenutlons  at  the  center  of  attention, 
but  It  Is  important  to  rentember  that  the  limitations  on  the  processes  that  create  and  use  these 
representations  are  an  Important  factor  In  determining  their  structure,  because  one  of  the 
constraints  on  vision  Is  that  the  description  ultimately  produced  be  derivable  from  Images.  In 
general,  the  structure  of  a representation  Is  determined  at  the  lower  levels  mostly  by  what  It  Is 
pottlble  to  compute,  whereu  later  on  they  can  afford  to  be  Infhienoed  by  what  It  Is  desirable  to 
compute  for  the  purposes  of  recognition. 


1:  Barlj  proooMtlng  problems 


iJO:  Th*  primal  skateh 

There  are  two  Important  kinds  of  information  contained  In  an  Intensity  array,  the 
Intensity  changes  present  there,  and  the  local  geometry  of  the  image  The  primal  sketch  (Marr 
1976)  la  a primitive  representation  that  allows  this  Information  to  be  made  explicit  Following  the 
clues  available  from  neurophysiology  (Hubei  & WIesel  1962),  intensity  changes  are  represented  by 
blobs  and  by  oriented  elements  that  specify  a position,  a contrast  a spatial  extent  associated  with 
the  Intensity  change,  a weak  characterlutlon  of  the  type  of  intensity  change  Involved,  and  a 
specification  of  points  at  which  intensity  changes  cease  (so-called  termination  points).  The 
representation  of  local  geometry  makes  explicit  two^lmensional  geometrical  relations  between 
significant  Items  In  an  image  These  Include  parallel  relationships  between  nearby  edges,  and  the 
relative  positions  and  orientations  of  significant  places  In  the  Image  These  significant  places  are 
marked  by  “ptaice-tokens*,  and  they  are  defined  In  a variety  of  ways,  by  blobs  or  local  patches  of 
different  intensity,  by  small  Hnes,  and  by  the  ends  of  lines  or  bare  The  local  geometrical  relations 
between  placemens  are  represented  by  inserting  virtual  Uncs  that  Join  nearby  place-tokene  thus 
making  exphdt  the  existence  of  a relation  between  the  two  tokeru,  their  relative  orientation,  and 
the  distance  between  them  (Marr  1976  figure  12a). 
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L The  primel  ikelth  mekci  cxplielt  infermetim  held  In  ui  intaniiqr  array  (la).  There  are  two 
Unde,  one  concerne  dw  chanfei  in  iniemity,  and  thli  it  repreeented  by  oriented  edge,  line  and  bar 
elementa.  aieodated  with  which  Is  a meawre  of  the  oontrut  and  spatial  extent  of  the  Intensity 
diangc.  The  ether  Und  at  Monmlkm  Is  the  local  twodlmenslonal  geometry  of  significant  places 
In  the  Image.  Such  places  are  marked  by  \riaoe-iokms*.  whidi  can  be  defined  In  a variety  of 

wEjS»  MM  CM  IMIMirK  MMOHi  OtCliMI  CMRI  wt  ffpnMMQ  vj  ■Mfung  TiniNII  mia  PCCw9..l 

neerby  tokens.  (Marr  MSB  f%nros  7 and  12a)L 
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2.  2a  and  2c  are  random-dot  interference  patterns  of  the  kind  described  by  Glass  (i969).  2b  and  2d 
exhibit  the  results  of  running  the  algorithm  desaibed  in  the  text  and  figure  3.  The  neighborhood 
radius  was  such  that  roughly  8 neighbors  were  included.  (Stevens  1977  figure  5). 
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IJ:  Random-dot  inttrftrnet  pattrms 

The  Idea  of  place^kent  and  of  this  way  of  representing  geometrical  relations  arose 
from  considering  the  computational  probtems  that  are  posed  by  early  visual  processing,  and  one  of 
the  questtons  we  have  been  asking  is,  can  one  find  any  psychophysical  evidence  that  the  human 
visual  system  makes  use  of  a similar  representation?  We  have  recently  oLrdned  two  results  related 
to  this  point  Stevens  (1977)  has  examined  the  perception  of  random-dot  Interference  patterns 
(figure  2).  constructed  by  superimposing  two  copies  of  a random  dot  pattern  where  one  copy  has 
undergone  some  composition  of  expansion,  translation,  or  rotation  transformations  (Class  1969).  He 
found  that  a simple  algorithm  suffices  to  account  quantitatively  for  human  performance  on  these 
patterns.  The  a%orlthm  consists  of  three  steps: 

(1)  Each  dot  defines  a place-token.  For  example  some  dots  can  be  replaced  by  small  lines  or 
larger  blobs  without  disrupting  the  subjective  Impreuion  of  flow. 

(2)  Virtual  lines  are  inserted  between  nearby  place-tokens,  and  the  neighborhood  in  which  the 
virtual  lines  arc  inserted  depends  in  a predictable  way  on  the  density  of  the  dots. 

(S)  The  orientations  of  the  virtual  lines  attached  to  all  the  points  in  each  neighborhood  are 
histogrammed,  and  locally  parallel  organization  is  found  by  searching  for  a peak  in  this  histogram. 
The  bucket  width  that  best  matches  human  performance  is  about  10  degrees. 

The  details  of  these  steps  are  set  out  in  figure  S.  The  interesting  features  of  the 
algorithm  are;  (a)  It  is  not  iterative.  Stevens  couM  find  no  evidence  that  human  performance  resu 
on  a cooperative  algorithm,  although  this  type  of  problem  is  ideal  for  that  approach,  (b)  The 
algorithm  is  purely  local.  No  global-to-local  or  top-down  interactions  are  necessary  to  explain 
human  performance,  (c)  What  the  algorithm  finds  is  locally  parallel  organization.  In  this  case,  the 
organization  lies  in  the  virtual  lines  constructed  between  nearby  dots,  but  locally  parallel 
organization  among  the  real  edges  and  lines  in  an  image  alto  forms  an  important  part  of  the 
structure  of  an  image  (Marr  1976). 

U:Ttxtur$  dUcrlminatUm 

The  second  study  is  one  by  Schatz  (1977)  on  texture  vision  discrimination.  Marr 
(1976)  suggested  that  such  discriminations  could  be  carried  out  by  first-order  discriminations  acting 
on  the  description  in  the  primal  sketch  (pJIOl).  Marr  supposed  that  certain  grouping  processes  were 
needed  before  the  discriminations  are  made  in  order  to  account  for  the  full  range  of  human 
texture  discrimination,  but  in  a careful  examination  of  the  problem,  Schatz  found  that  many  of 
the  examples  he  constructed  could  be  explained  by  assuming  that  the  discriminations  are  made 
only  on  real  edges  or  on  virtual  lina  inserted  between  neighboring  place-tokens.  If  this  were 
generally  true,  it  would  stand  in  elegant  relation  to  Julesz’s  (1975)  conjecture,  that  a necessary 
condition  for  the  dlscriminability  of  two  textures  Is  that  their  dipole  statistics  differ.  This 
condition  is  known  not  to  be  sufficient,  a state  of  affairs  that  one  can  view  as  Implying  that  we 
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S.  The  algorithm  for  computing  hxally  paralM  structure  has  three  fundamental  steps.  Place  tokens 
that  are  defined  In  the  Image  are  the  input  to  the  algorithm,  which  is  applied  in  parallel  to  each 
one.  Since,  in  the  case  of  the  Moire  dot  patterns,  each  dot  contlbutes  a place  token,  the  first  step  is 
to  construct  a virtual  line  from  that  dot  to  each  neighboring  dot  (within  some  neighborhood 
cantered  on  the  dot).  A virtual  line  represents  the  position,  sq)aratlon,  and  orientation  between  a 
pair  of  neighboring  dots.  To  favor  relatively  neater  neighbors,  relatively  short  virtual  lines  are 
emphasized.  The  second  step  is  to  histogram  the  orientttlons  of  the  virtual  lines  that  were 
constructed  for  eadi  of  the  neighbors.  For  example,  the  neighbor  0 would  contribute  orientations 
AD.  DF,  DC,  and  DH  to  the  histogram.  The  final  step  (after  smoothing  the  histogram)  is  to 
determine  the  orientation  at  whidi  the  histogram  peaks,  and  to  select  that  virtual  line  (AB)  closest 
to  that  orientation  as  the  sohidan.  (Stevens  1977  figure  4). 
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4.  Th«  pattern  4a  oontaini  two  regioni,  one  of  whose  line  segments  has  the  orientation 
distribution  shown  in  4b,  and  the  other  has  the  distribution  4c.  Surprisingly,  three 
orientations  cannot  be  distinguished  from  a random  orientation  distribution.  If  human 
texture  discrimination  is  based  on  first-order  discriminations  acting  on  the  description  held 
in  the  primal  sketch,  the  discriminants  that  can  be  brought  to  bear  on  this  information  are 
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have  acceu  to  only  a proper  subset  of  all  dipole  statistics.  It  U possible  that  this  proper  subset 
consists  only  of  real  edges  and  of  the  virtual  lines  that  Join  nearby  place-tokens. 

tJ:  tHserlndnatUm  abUlty 

If  one  accepts  that  texture  discrimination  relies  upon  first-order  discriminations  of 
this  typ^  It  is  natural  to  ask  how  sensitive  are  the  particular  discrimination  functions  that  we  can 
bring  to  bear  on  an  image.  Riley  (1977)  has  found  evidence  that  the  available  functions  are 
extremely  coarse.  For  example,  figure  4 consists  of  a background  in  which  the  Mne  segments  have 
a random  orientation,  surrounding  a square  containing  lines  of  only  three  orientations. 
Surprisingly,  the  square  cannot  be  discerned  without  scrutiny.  One  Interpretation  of  this  and 
related  findings  is,  that  discriminations  on  orientations  other  than  horiiontal  and  vertical  are  made 
on  the  output  of  5 channels,  each  nearly  binary,  and  with  an  angular  width  of  about  35  degrees  — 
In  other  words,  only  very  little  information  is  available  about  the  distribution  of  orientations  in  sui 
image.  It  appears  that  our  discrimination  ability  is  as  poor  or  poorer  for  the  other  stimulus 
dimensions,  for  example  intensity  distribution  (Riley  1977). 

i.4:  Ltfht  source  tfftcts 

In  another  study  concerned  with  what  can  be  extracted  from  an  image,  Ullman 
(1976a)  enquired  about  the  possible  physical  basis  for  the  subjective  quality  of  fluorescence,  which 
Is  normally  associated  with  the  presence  of  a light  source.  He  noted  that  at  a light  source 
boundary,  the  ratio  of  Intensity  to  intensity  gradient  changes  sharply,  whereas  this  is  not  true  at 
reflectanoe  boundaries  unless  the  surface  orientation  changes  sharply.  He  showed  that,  in  the 
mini-world  of  Mondrians,  the  discriminant  to  which  this  leads  predicts  human  performance 
satisfactorily. 

K.  Forbus  (in  preparation)  hu  extended  this  work  to  the  detection  of  surface  luster. 
Since  glossiness  is  due  to  the  specular  component  of  a surface  reflectivity  function,  one  can  treat 
the  detection  of  gloss  as  essentially  the  detection  of  Mght  sources  that  appear  reflected  in  a surface 
(see  Beck  1974),  and  this  depends  ultimately  on  the  ability  to  detect  light  sources.  Forbus  divided 
the  problem  into  three  categories;  (a)  in  which  the  specularity  is  too  small  to  allow  gradient 
measurements,  (b)  in  which  both  intensity  and  gradient  measurements  are  available,  but  the 
specularity  is  local  (as  it  is  for  a curved  surface  or  a point  source),  and  (c)  in  which  the  surface  is 
planar  and  the  source  is  extended.  He  derived  diagnostic  criteria  for  each  case. 

tJ:  Rtpms  from  a tUserimtnanC 

Whenever  a region  is  defined  in  an  Image  by  a predicate,  for  example  by  a 
difference  in  texture  or  brightness,  one  faces  the  problem  of  delimiting  the  region  accurately. 

^ There  are  two  approaches  to  dnlgnlng  algorithms  for  this  problem;  one  is  to  use  the  predicate 
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directly,  deciding  whether  a given  location  lies  within  or  without  the  region  by  testing  sonte 
function  of  the  predicate  there.  The  second  approach  is  to  differentiate  the  predicate,  defining  the 
region  by  its  boundaries  rather  than  by  properties  of  its  interior. 

The  dlfflculUet  with  the  problem  arise  because  one  is  usually  ignorant  beforehand 
of  the  icale  at  which  significant  predicate  signals  may  be  gathered.  For  example,  suppose  one 
wished  to  find  the  boundary  between  two  regions  that  are  distinguished  by  different  densities  of 
dots.  Dot  density  has  to  be  measured  by  selecting  a neighborhood  size  and  counting  the  number  of 
dots  that  He  within  it  If  the  neighborhood  size  is  too  large,  one  may  not  be  able  to  resolve  the 
regions.  If  It  is  to  small  u to  contain  zero,  one  or  two  dots,  natural  fluctuations  may  obscure  any 
changes  in  density. 

One  solution  to  this  problem  it  to  make  the  measurements  simultaneously  at  several 
neighborhood  sizes,  looking  for  agreement  between  the  resuHs  obtained  in  those  neighborhood 
sizes  that  lie  Just  above  the  size  at  which  random  fluctuations  appear.  This  technique  can  be 
appHed  to  region  finding  or  to  boundary  finding,  and  an  example  of  the  results  is  given  in  figure 
5.  The  dot  density  here  is  not  known  a prierL 

This  issue  is  of  considerable  techkal  interest,  but  it  is  important  not  to  lose  sight  of 
the  underlying  computational  problem,  which  is  what  kind  of  boundary  is  to  be  found,  and  why? 
The  techniques  of  O’Callaghan  (N74)  for  example  are  designed  to  find  boundaries  in  dot  patterns 
so  accurately  that  their  positions  are  determined  up  to  the  decision  about  which  dott  it  passes 
through.  The  Justification  for  this  type  of  study  is  that  humans  can  auign  boundaries  this 
accurately,  but  the  difficulty  lies  in  formulating  a reasonable  definition  of  what  the  boundary  is. 

This  problem  is  a deep  one,  touching  the  heart  of  the  question  of  what  early  vision 
is  fur.  I shall  return  to  it  later  in  this  essay,  but  it  is  perhaps  worth  remarking  here  that  there 
seenu  to  be  a clear  need  for  being  able  to  do  early  visual  processing  roughly  and  fast  as  well  as 
more  slowly  and  accurately,  which  means  having  ways  of  handUng  rough  descriptions  of  regions  - 
ways  of  characterizing  their  approximate  extent  and  shape  - characterizing  their  precise 
boundaries.  Figure  6 contains  one  example  of  a region  whose  rough  extent  is  clear,  but  whose 
exact  boundary  is  not 

The  motivation  for  wanting  this  is  that  rough  descriptions  are  very  useful  during 
the  early  stages  of  building  a shape  description  for  recognition  (Marr  k Nishihara  1977).  For 
example  a man  often  appears  as  a roughly  vertical  rectangle  in  an  image,  and  this  information  is 
useful  because  it  eliminates  manjr  other  shapes  from  consideration  quite  early.  Campbell  (1977)  has 
suggested  that  the  extraction  of  rough  descriptions  from  an  image  may  depend  on  the  ability  to 
examine  Ms  lower  spatial  frequencies.  Even  if  this  is  one  of  the  available  mechanisms  it  is  unlikely 
to  be  the  only  one,  because  sparse  Him  drawings  can  raise  the  same  problems  while  having  almost 
no  power  in  their  low  frequencies.  It  may  be  that  some  notfon  of  rough  grouping  applied  to  low 
resolution  place-tokens  set  up  by  pieces  of  contour  in  the  image  provides  a useful  approach  to  this 


6.  An  example  of  a region  whose  rough  boundary  is  clear,  but  whose  exact  boundary  is  not. 
(Drawing  by  K.  Prendergast,  1977). 
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problcfn. 

Ij6:  lAgktiuss 

Ever  since  Ernst  Mach  noticed  the  bands  named  after  him.  there  has  been 
considerable  interest  in  the  problem  of  computing  perceived  brightness.  Of  especial  interest  is  the 
recent  work  of  Land  Ic  McCann  (1971)  on  the  retinex  theory  (see  also  Horn  1974),  which  is 
concerned  with  the  quantity  they  call  lightness;  and  that  of  Colas-Baudelaire  (1973)  on  the 
computation  of  perceived  brightness.  Lightness  is  an  approximation  to  reflectance  that  is  obtained 
by  filtering  out  slow  intensity  changes,  the  underlying  idea  being  that  these  are  usually  due  to  the 
illuminant,  not  to  changes  in  reflectance.  The  problem  with  thb  idea  is  of  course  that  some  slow 
changes  in  intensity  are  perceptually  important  (see  Horn  1977  for  an  analysis  of  shape  from 
shading).  The  linear  filter  model  of  Colas-Baudelaire  performs  well  on  images  in  which  there  are 
no  sharp  changes  in  intensity,  but  the  author  found  it  difficult  to  extend  his  model  to  the  more 
general  case.  The  recent  finding  of  Glkhrlst  (1977),  that  perceived  depth  influences  perceived 
brightness,  suggests  that  some  aspects  of  the  problem  occur  quite  late  - in  our  terms,  at  the  level  of 
the  sketch  (see  below). 

Our  own  work  on  the  brightness  problem  is  probably  not  relevant  to  the  perception 
of  brightness,  but  it  is  interesting  as  a demonstration  that  the  primal  sketch  loses  very  little 
Information.  Woodham  tc  Marr  (unpublished  program)  have  written  a program  that  inverts  the 
primal  sketch,  so  that  its  output  is  an  intensity  array.  The  basic  idea  is  to  scan  outwards  from 
edges,  assigning  a constant  brightness  to  points  along  the  scan  lines,  and  arresting  the  scan  when  it 
encounters  another  edge  Figure  7 exhibits  the  results  of  running  this  program,  showing  the 
original  image  (7k),  the  primal  sketch  (7b),  and  the  reconstructed  Intensity  array  (7c). 

2:  Proerlontad  thoorlos 

2j(f:  /ntroductUm 

I said  earlier  that,  especially  at  the  earlier  stages  of  visual  information  processing, 
the  representations  and  processes  are  determined  more  by  what  it  is  possible  to  compute  from  an 
Image  than  by  what  is  desirable.  Examples  are  the  problems  associated  with  structure  from 
motion,  stereopsis,  texture  gradients,  and  shading.  << 

2J:  Structun  from  motion 

Given  a sequence  of  views  of  objects  in  motion,  the  human  visual  system  is  capable 
of  Interpreting  the  changing  views  in  terms  of  the  shapes  of  the  viewed  objects,  and  their  motion 
In  three-dimensional  space.  Even  If  each  successive  view  is  unrecognizable,  the  human  observer 
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easily  perceives  these  views  In  terms  of  moving  objects  (Wallach  le  O’Donnell  I9SS).  To  answer  the 
question  of  how  a succession  of  Imaga  yields  an  Interpretation  in  terms  of  three^lmenslonal 
structure  In  motion.  Ullinan  (1977)  divided  the  problem  into  two  parts:  (I)  finding  a correspondence 
between  elements  In  succeulve  views;  and  (2)  determining  the  three^imenslonal  structures  and 
their  motion  from  the  way  corresponding  elements  move  between  views. 

An  Important  preHminary  question  about  the  correspondence  problem  concerns  the 
level  at  which  It  takes  place.  Is  It  primarily  a low-level  relation,  established  between  small  and 
simple  parts  of  the  scenes  and  largely  independent  of  higher-level  knowledge  and  three- 
dimensional  Interpretation?  Or  do  higher  level  influences,  like  the  interpretation  of  the  whole  of  a 
shape  from  one  frame,  play  an  Important  part  In  determining  the  correspondence? 

Ulbnan  has  assembled  a considerable  amount  of  evidence  that  the  former  view  Is 
correct  For  example,  figure  8 shows  two  successive  frames,  one  denoted  with  full  lines  and  the 
other  with  dotted  llnei  If  the  whole  pattern  were  being  analysed  from  one  frame,  the  shape  of  the 
wheel  extracted,  and  used  to  match  the  elements  in  the  next  frame,  the  observer  presented  with 
these  frames  In  rapid  succession  should  perceive  them  as  a whole  wheel  rotating.  Notice  however 
that  the  Inner  and  outer  parts  of  the  wheel  have  their  closest  neighbors  in  one  direction,  whereas 
the  center  parts  have  theirs  In  the  other;  because  of  this.  If  the  matching  were  done  early  and 
locally,  the  observer  should  see  the  center  part  rotating  one  way,  and  the  Inner  and  outer  rings 
rotating  the  other  (as  shown  with  arrows  in  figure  8).  When  appropriately  timed,  this  Is  In  fact 
what  happens. 

Another  line  of  evidence  is  the  following.  The  most  important  factor  in  finding  a 
correspondence  between  elements  is  the  distance  the  element  moves  from  one  view  to  the  next  But 
Is  this  distance  an  objective  two-dimensional  measurement  or  an  Interpreted  movement  In  three- 
dimensional  space?  There  is  some  confusion  In  the  literature  about  this  point  since  many  studies 
have  assumed  that  correspondence  strength  Is  linked  to  the  smoothness  of  apparent  motion  (Kolers 
1972).  and  this  Is  apparently  more  closely  related  to  three-  than  to  two-dimensional  distances. 
Ulhnan  (1977)  has  however  shown  that  this  assumption  Is  false,  and  that  It  Is  the  two-dimensional 
distance  alone  that  determines  the  correspondence. 

The  second  part  of  rhe  problem  is  to  determine  the  three-dimensional  structure  once 
the  correspondence  between  successive  views  has  been  established.  Unless  this  problem  is 
constrained  in  some  way.  it  cannot  be  solved,  so  one  has  to  search  for  reasonable  assumptions  on 
which  to  base  the  design  of  one’s  algorithms.  (This  stats  of  affairs  Is  a common  one  In  the  theory 
of  visual  processes,  as  we  shall  see  when  we  discuss  the  problems  of  stereopsis,  and  shape  from 
contour).  Ulhnan  suggested  basing  the  Interpretation  on  the  following  assumptions;  (I)  any  two- 
dimensional  transformation  that  hu  a unique  Interpretation  as  a rigid  body  moving  In  space 
should  be  Interpreted  as  such  an  object  In  motion,  and  (2)  that  the  Imaging  proceu  Is  locally  an 
orthogonal  projection.  He  then  showed  that  under  orthogonal  projection,  three-dimensional  shape 
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8.  Evidence  ihtt  the  correapondence  problem  for  apparent  motion  involves  matching 
operations  that  act  at  a low-icvel  Frame  I te  shown  with  full  lines  and  frame  2 with  dotted 
lines.  Instead  of  aeeinf  a single  wheal  rotating,  when  appropriately  timed  the  wheel  splits, 
the  outer  and  inner  rings  rotating  one  way.  and  the  CMMer  rotating  |he  other,  as  indicated 
by  the  arrows.  This  suggesu  that  matching  is  carried  out  on  elemental  line  segments,  and  is 
governed  primarily  by  proaifflity.  (Adapted  from  UHman  I«77)l 
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and  motion  may  be  recovered  from  ai  little  as  three  views  each  showing  the  Image  of  the  same 
five  points,  no  four  of  which  are  coplanar.  This  result  leads  to  algorithms  capable  of  recovering 
shape  and  motion  from  scenes  containing  arbitrary  objects  In  motion.  The  final  question  Is 
whether  the  algorithms  that  humans  employ  to  recover  shape  and  motlan  rely  on  these  same  two 
assumptions,  and  this  question  Is  currently  under  Investigation.  The  Important  point  here  Is  that 
for  more  human-like  algorithms,  the  number  of  views  can  be  traded  off  against  the  accuracy  of 
the  computation,  decreuing  the  emphasis  on  the  pardcular  number  "three*. 

2^:St«r0opsU 

Ever  since  Julesz  (1971)  made  the  first  randonnlot  stereogram.  It  has  been  clear  that 
at  least  to  a first  approximation  stereo  vision  can  be  regarded  as  a modular  component  of  the 
human  visual  system.  Marr  (1974)  and  Marr  Jb  Pogglo  (1976)  formulated  the  computational  theory 
of  the  stereo  matching  problem  In  the  following  way: 

(Rl)  Unifutntss.  Each  Item  from  each  Image  may  be  assigned  at  most  one  disparity  value  This 
condition  rests  on  the  premise  that  the  Items  to  be  matched  correspond  to  physical  marks  on  a 
surface,  aitd  so  can  be  In  only  one  place  at  a time 

(R2)  Continuity.  Disparity  varies  smoothly  almost  everywhere  This  condition  Is  a consequence  of 
the  cohesiveneu  of  matter,  and  It  states  that  only  a relatively  small  fraction  of  the  area  of  an 
Image  Is  composed  of  boundaries. 

By  representing  these  constraints  geometrically,  Marr  tc  Pogglo  (1976)  embodied 
them  In  a cooperative  algorithm.  In  figure  9,  Lx  and  Rx  represent  the  positions  of  descriptive 
elements  from  the  left  and  right  views,  and  the  horizontal  and  vertical  lines  Indicate  the  range  of 
disparity  values  that  can  be  assigned  to  left-eye  and  right-eye  elements.  The  uniqueness  condition 
then  corresponds  to  the  assertion  that  only  one  disparity  value  may  be  "on"  along  each  horizontal 
or  vertical  line.  The  continuity  condition  states  that  we  sedi  solutions  that  tend  to  spread  along  the 
dotted  diagonals,  which  are  Hna  of  constant  disparity,  and  between  adjacent  diagonals.  Figure  9b 
shows  how  this  geometry  appears  at  each  intersection  point  Figure  9c  gives  the  corresponding 
local  geometry  when  the  Images  are  twodimensional  rather  than  one  and  R2  for  the  case  of  a 
one-dimenslonal  Image  and  it  also  represents  the  structure  of  a network  for  Implementing  the 
algorithm  described  by  equation  I.  Solid  lines  represent  "Inhibitory*  Interactions,  and  dotted  lines 
represent  "excitatory*  ones.  9b  gives  the  local  structure  at  each  node  of  the  network  9a.  This 
algorithm  may  be  extended  to  two-dimensional  Images,  In  which  case  each  node  In  the 
corresponding  network  hu  the  local  structure  shown  in  9c.  Such  a network  was  used  to  solve  the 
stereograms  exhibited  in  figures  K)  and  IL  (Marr  k Pogglo  1976  figure  2). 

It  can  be  shown  (Marr,  Pogglo  k Palm  1977)  that.  If  a network  is  created  with  the 
positive  and  negative  connections  shown  In  figure  9c,  slates  of  such  a network  that  satisfy  the 
constraints  on  the  computation  are  stable,  and  that  given  suitable  inputs,  the  network  will  converge 
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•.  Fifurt  9a  shows  the  txpttdt  structura  of  tha  two  rates  RI  and  R2  for  tha  cast  of  a one- 
dlf^lonat  Imaga.  and  It  also  rapresanu  tha  structura  of  a network  for  Imptementing  the 
a%ort^  described  b]r  equathm  t Solid  Unas  rapressnt  "Inhibitory”  Interactions,  and  dotted  lines 
rapresent  ”eacitaiory”  ones.  9b  gives  the  local  stniciofc  at  each  node  of  the  network  9a.  Thte 
algorithm  may  ba  axtendad  to  two-dloMnslonal  imagas.  in  which  case  each  node  in  the 
nrreaponding  network  has  dw  local  itnidnre  Aown  in  tc.  Such  a network  was  used  to  solve  the 
I exhlbiiod  in  figiiras  10  and  d.  (Marr  te  Pagglo  1919  f%ure  2)l 
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KX  This  and  the  following  figure  show  the  results  of  applying  the  algorithm  defined  by  equation 

(1)  to  two  random-dot  stereograms.  The  Initial  slate  of  the  network  Is  defined  by  the  Input  j 

such  that  a node  lakes  the  value  I If  It  occurs  at  the  Intersection  of  a I In  the  left  and  right  eyes 

(see  figure  9X  and  It  has  value  0 otherwise  The  network  iterates  on  this  initial  state,  and  the 

parameters  used  here,  as  suggested  by  the  combinatorial  analysis,  were  9 • S.0,  e - 2.0  and  M > 5, 

where  # is  the  threshold  and  ilf  is  the  diameter  of  the  'exdtatory"  neighborhood  illustrated  in 

figure  9c.  The  stereograms  themseWei  are  labelled  LEFT  and  RIGHT,  the  initial  state  of  the 

network  as  0.  and  the  state  after  n Iteratkms  Is  marked  as  such.  To  understand  how  the  figures 

represent  states  of  the  network.  Imagine  looking  at  It  from  above  The  different  disparity  layers  In 

the  network  He  In  parallel  planes  spread  out  horisontaliy,  so  that  the  viewer  is  looking  down 

through  them.  In  each  plane  some  nodes  are  on  and  some  are  off.  Each  of  the  seven  layers  in 

the  network  has  been  assigned  a different  gray  level,  so  that  a node  that  Is  switched  on  in  the  top 

layer  (corresponding  to  a disparity  of  «S  pixds)  contributes  a dark  point  to  the  image  and  one  that 

Is  switched  on  In  the  lowest  layer  (disparity  - -9)  contributes  a lighter  point  Initially  (Iteration  0) 

the  network  Is  disorganised,  but  In  the  final  state  stable  order  hu  been  achieved  (iteration  14).  and  j 

the  inverted  wedding-cake  structure  has  been  found.  The  density  of  this  stereogram  Is  SOt.  (Marr  \ 

k Pogglo  1976  figure  SX 
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II.  The  algorithm  of  equation  1,  with  parameter  values  given  In  the  legend  to  figure  10,  Is  capable 
of  solving  random-dot  stereograms  with  densities  from  50*  down  to  less  than  101  For  this  and 
smaller  densities,  the  algorithm  converges  increasingly  slowly.  If  a simple  homeostatic  mechanism 
Is  allowed  to  control  the  threshold  9 as  a function  of  the  average  activity  (number  of  “on"  cells)  at 
each  iteration,  the  algorithm  can  solve  stereograms  whose  density  is  very  low.  In  this  example,  the 
density  Is  5*  and  the  central  square  has  a disparity  of  +2  relative  to  the  background.  The 
algorithm  "fills  In"  those  areas  where  no  dots  are  present,  but  it  takes  several  ntore  Iterations  to 
arrive  near  the  solution  than  in  cases  where  the  density  is  50*.  When  we  look  at  a sparse 
stereogram,  we  perceive  the  shapes  in  It  as  cleaner  than  those  found  by  the  algorithm.  This  seems 
to  be  due  to  subjective  contours  that  arise  between  dots  that  lie  on  shape  boundaries.  (Marr  and 
Poggio  1976  figure  4). 


2 


D.  Marr 


28 


Representing  Tisinl  Information 


to  these  stable  states  for  a wide  variety  of  the  control  parameters.  Thus  one  can  think  of  the 
network  as  defining  an  algorithm  that  operates  on  many  input  elements  to  produce  a global 
organization  ate  local  but  highly  interactive  constraints.  Formally,  the  algoiithm  reads: 


y 

x'y'd'f  ofxyd) 


(1) 


where  u(z)  - 0 if  z < #,  and  u(z)  • / otherwise,  5 and  0 are  the  circular  and  thick  line 
neighborhoods  of  the  cell  in  figure  9c.  This  is  an  example  of  a ‘’cooperative”  a%orithm 
(Marr  8c  Poggio  1977a),  and  it  exhibits  typical  non-linear  cooperative  phenomena  like  hysteresis, 
filling-in,  and  disorder-order  transitions.  Figures  10  and  II  illustrate  two  applications  of  the 
algorithm  to  random-dot  stereograms. 

There  are  a number  of  findings  that  cast  doubt  on  the  relevance  of  this  algorithm 
to  the  question  of  how  human  stereo  vision  works.  The  most  important  of  these  findings  are  (a) 
the  apparently  crucial  role  played  by  eye-movements  in  human  stereo  vision  (see  especially 
Richards  I977>,  (b)  our  ability  to  tolerate  up  to  ISX  expansion  of  one  image  (Julesz  1971  figure 
2.8.8>,  (c)  our  ability  to  tolerate  the  severe  defocussing  of  one  image  (Julesz  1971  figure  3.I0.3>.  (d) 
evidence  that  stereo  detectors  are  organized  into  "three  pools”  (convergent,  zero  disparity,  and 
divergent)  and  that  this  organization  is  important  for  stereo  vision  (Richards  1971);  and  (e)  our 
ability  to  perceive  depth  in  rivalrous  stereograms  (Mayhew  k Frisby  1976).  These  difficulties  led 
Marr  8c  Poggio  (1977b)  to  formulate  a second  stereo  algorithm,  designed  specifically  as  a model  for 
human  stereopsls. 

Our  first  stereo  theory  was  inspired  by  Julesz’s  belief  that  stereoscopic  fusion  is  a 
cooperative  process  - a belief  based  primarily  on  the  observation  that  it  exhibits  hysteresis.  The 
main  problem  with  the  cooperative  a^rorlthm  is  that  it  apparently  works  too  well  in  some  ways  (it 
performs  better  that  humans  do  when  eye-movements  are  eliminated),  and  not  well  enough  in 
others  (humans  see  depth  in  rivalrous  stereograms).  Our  ability  to  fuse  two  images  when  one  is 
blurred,  the  rivalrous  stereogram  results  of  Mayhew  8c  Frisby  (1976),  and  the  recent  results  of  Julesz 
8c  Miller  (1975)  on  the  existence  of  Independent  spatlaHrequency-tuned  channels  in  binocular 
fusion,  suggest  that  several  copies  of  the  image,  obtained  by  successively  coarser  filtering,  are  used 
during  fusion,  perhaps  helping  one  another  in  a way  similar  to  that  in  which  local  regions  help 
each  other  in  our  cooperative  algorithm. 

The  second  idea  was  a notion  that  originated  with  Marr  8c  Nishlhara  (1977)  and 
about  which  I shall  have  more  to  say  later,  which  is  that  one  of  the  things  early  visual  processing 
does  it  to  construct  a "depth  map”  of  the  surfaces  round  a viewer.  In  this  map.  each  direction 
away  from  the  viewer  is  associated  with  a distance  (or  some  function  of  distance)  and  a surface 
orientation.  We  have  christened  the  resulting  datastnicture  the  24 ‘D  sktteh. 

The  important  point  here  Is  that  the  24*D  sketch  it  in  some  sense  a memory.  This 
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provided  the  key  idea:  Suppose  that  the  hfstereiii  Julesz  observed  it  not  due  to  a cooperative 
process  at  all,  but  is  in  fact  the  result  of  using  a memory  buffer  in  which  to  store  the  depth  map  of 
the  Imafe  as  it  it  discovered.  Then,  the  fusion  proceu  itself  need  not  be  cooperative,  and  in  fact  It 
would  not  even  be  necessary  for  the  whole  image  ever  to  be  fused  everywhere  provided  that  a 
depth  map  of  the  viewed  surface  were  buik  and  maintained  in  this  intermediate  memory.  This 
idea  leads  to  the  following  theory.  (1)  Each  image  is  convolved  with  bar^haped  masks  of  various 
sizes,  and  matching  takes  place  between  peak  mask  values  for  ditpariUet  up  to  about  twice  the 
panel-width  of  the  mask  (see  Felton,  Richards  tc  Smith  1972),  for  pairs  of  masks  of  the  same  size 
and  polarity.  (2)  Wide  masks  can  control  vergence  movements,  thus  causing  small  masks  to  come 
into  correspondence.  (S)  When  a correspondence  is  achieved,  it  is  held  and  written  down 
somewhere  (e.g.  in  the  2^-D  sketch).  (4)  There  is  a backwards  relation  between  the  memory  and 
the  masks,  perhaps  simply  through  the  control  of  eye-movements,  that  allows  one  to  fuse  any  piece 
of  a surface  easily  once  its  depth  map  hu  been  established  in  the  memory. 

This  theory  leads  to  many  experimental  predictions,  which  are  currently  being 

tested. 
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SjO  IntroduaUm 

We  have  discussed  the  types  of  information  that  need  to  be  represented  early  in  the 
processing  of  visual  information,  and  we  have  examined  the  computational  structure  of  some  of 
the  processes  that  am  derive  and  maintain  this  information.  We  turn  now  to  the  question  of  what 
all  this  information  is  to  be  used  for. 

9.1  DtffteultUs  noth  the  tdra  cf  Imag*  tigmtntatlon 

The  current  approach  to  machine  vision  assumes  that  the  next  step  in  visual 
processing  consists  of  a process  called  ugmtntation,  whose  purpose  is  to  divide  the  image  into 
regions  that  are  meaningful  either  in  terms  of  physical  objects  or  for  the  purpose  at  hand.  Despite 
considerable  efforts  over  a long  period,  the  theory  and  practise  of  sq^mentatlon  remain  primitive, 
and  once  again  I believe  that  the  main  reason  lies  in  the  failure  to  formulate  precisely  the  goals  of 
this  stage  of  the  processing.  What  for  example  is  an  object?  It  a head  one?  Is  it  still  one  if  it  is 
attached  to  a body?  What  about  a man  on  horseback? 

These  questions  point  to  some  of  the  difficulties  one  has  when  trying  to  formulate 
what  should  be  recovered  u a region  from  early  visual  processing.  Furthermore,  however  one 
chooses  to  answer  them,  it  is  usually  still  impossible  to  recover  the  desired  regions  using  only  local 
grouping  techniques  acting  on  a representation  like  the  primal  sketch.  Most  images  are  too 
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comploi.  and  avan  tha  timpleit  imafet  cannot  oflan  ba  s^gmentad  entiral7  at  that  level  (a^.  Marr 
1076  figure  IS). 

Something  additional  is  charlf  needed,  and  one  approach  to  the  dilemna  has  been 
to  invoke  spactaHied  knowledge  about  dM  nature  of  tha  scenes  being  viewed  to  aid  segmentation 
of  tha  image  into  regions  that  correspond  roughly  to  the  objects  expected  in  the  scene. 
Tenenbaum  k Barrow  (I97BX  for  example;  applied  knowledge  about  several  different  types  of 
scetM  to  the  segmentation  of  images  of  landscapes,  an  offica,  a room,  and  a compressor.  Freuder 
(1974)  used  a similar  approach  to  identify  a hammer  in  a simple  scene  If  this  approach  were 
correct,  it  would  mean  that  a central  problem  for  vision  is  arranging  for  the  right  piece  of 
spedaUxed  knowledge  to  be  made  available  at  the  apprapriaie  time  during  segmentation.  Freuder's 
work,  for  example;  was  almost  entirely  devoted  to  the  design  of  a heterarchical  control  system  that 
made  this  possible.  More  recently,  the  constraint  relaxation  technique  of  Rosenfeld,  Hummel  8e 
Zucker  (1976)  has  attracted  considerable  attention  for  Just  this  reason,  that  it  appears  to  offer  a 
technique  whereby  constraints  drawn  from  disparate  sources  may  be  applied  to  the  segmenatlon 
problem  whilst  incurring  only  minimal  penalties  in  control  It  is  however  difficult  to  analyze  such 
algorithms  rigorously  even  in  very  clearly  defined  situations  (see  e.g.  Marr,  Poggio  k Palm  1977), 
and  in  the  naturally  more  diffuse  drcumstances  that  surround  the  segmentation  problem,  it  may 
often  be  impossible. 

fJ:  Rtfbrmutattng  tht  prMm 

The  basic  problem  seems  to  be  how  to  formulate  precisely  the  next  stage  of  visual 
processing.  Given  a representation  like  the  primal  sketch,  and  the  many  pouible  boundary- 
defining  processes  that  are  naturally  associated  with  it,  which  boundaries  should  one  attend  to  and 
why?  The  segmentation  approach  fails  because  objeett  and  desirable  regions  are  not  visually 
primitive  constructions,  and  hence  cannot  be  recovered  reUabty  from  the  primal  sketch  or  similar 
representation  without  additional  specialized  knowledge.  If  we  are  to  succeed,  we  must  discover 
precisely  what  information  it  is  that  needs  to  be  made  explicit  at  this  stage,  what,  if  any,  additional 
knowledge  it  is  appropriate  to  apply,  and  we  mutt  design  a representation  that  matches  these 
requirements. 

In  order  to  search  for  dues  to  a suitable  representation,  let  us  return  to  the  physics 


of  the  situation.  The  primal  sketch  represents  intensity  changes  and  the  local  two-dimensional 
geometry  of  an  image  The  principle  factors  that  determine  these  are  (I)  the  ilhiminant,  (2)  surface 
reflectance  (9)  the  shape  of  the  visible  surface  and  (4)  the  vantage  point  The  first  two  favors 
raise  the  difficult  problems  of  color  and  biightncse  and  I shall  not  discuu  them  further.  The 
third  and  fourth  factors  are  independent  of  the  first  two  (whether  tvm  shapes  are  the  same  does 
not  depend  upon  their  colors  or  on  the  HghtingX  and  to  may  be  treated  separately. 


I Shan  argue  that,  tinos  most  early  visual  processes  extraa  information  about  the 


visible  surface.  It  Is  these  surfaces,  their  shape  and  disposition  relative  to  the  viewer,  that  need  to 
be  made  expHct  at  this  point  in  the  processing.  Furthermore,  because  surfaces  exist  In  three- 
dimensional  space,  this  Imposes  constralntt  on  them  that  are  general,  and  not  confined  to  particular 
objects.  It  Is  these  constraints  that  constitute  the  a priori  knowledge  that  it  is  appropriate  to  bring 
to  bear  next 

One  example  of  the  exploitation  of  fairly  general  constraints  was  the  work  of  Waits 
(1975),  who  formulated  the  constraints  that  apply  to  images  of  polyhedra.  The  represenutlon  on 
which  that  work  was  based  was  line  drawings,  but  these  are  not  suitable  for  our  needs  here, 
because  part  of  the  task  we  wish  to  carry  out  is  the  discovery  of  physical  edges  that  are  only 
areakly  present  or  even  absent  in  the  primal  sketch.  The  approach  of  Mackworth  (1979)  ivas  closer 
to  what  we  want,  since  it  involved  a primitive  way  of  representing  surfaces. 

SJ:G«n«ral  elasstfteation  of  shape  representations 

Part  of  our  task  in  formulating  the  problem  of  intermediate  vision  is  therefore  the 
examination  of  ways  or  representing  and  reasoning  about  surfaces.  We  therefore  start  our  enquiry 
by  dlscussli^  the  general  nature  of  shape  r^resentations.  What  kinds  are  there,  and  how  may  one 
dcdde  among  them?  Although  it  Is  difficult  to  formulate  a completely  general  classification  of 
shape  representations,  Marr  k Nlshihara  (1977)  attempted  to  set  out  the  bulc  design  choices  that 
have  to  be  made  when  a represenutlon  it  formulated.  They  concluded  that  there  are  three 
characteristics  of  a shape  represenutlon  that  are  largely  responsible  for  determining  the 
Information  that  it  makes  explicit  The  first  is  the  type  of  coordinate  system  it  uses,  whether  It  Is 
defined  relative  to  the  viewer  or  to  the  object  being  viewed;  the  second  characteristic  concerns  the 
nature  of  the  shape  primitives  used  by  the  represenutlon,  that  is,  the  elemenu  whose  positions  the 
coordinatt  system  is  used  to  define.  Are  they  two-  or  three-dimensional,  in  what  sizes  do  they 
come,  and  how  detailed  are  they?  And  the  third  is  concerned  with  the  organization  a 
represenutlon  Imposes  on  the  information  in  a description,  for  example  is  the  description  nradular 
or  does  it  have  little  internal  structure?  We  have  two  sources  of  information  that  can  help  us  to 
formulate  the  Imporunt  Issues  in  Intermediate  visual  Information  processing,  firstly  the 
oompuutlonal  problems  that  arise,  and  secondly,  psychophysics. 

S.4:  Some  Observations  from  psychophysics 

Vision  provides  seveni  sources  of  information  about  shape  The  most  direa  are 
stereo  and  motion,  but  texture  gradients  in  a single  image  are  nearly  as  effective  and  the  theatrical 
techniques  of  facial  make-up  rely  on  the  sensitivity  of  perceived  shape  to  shading.  It  often 
happens  that  some  parts  of  a scene  are  open  to  inspection  by  some  of  these  techniques,  and  other 
parts  by  others.  Yet  different  u the  techniques  are,  they  have  two  imporunt  characteristla  in 
common.  They  rely  on  Information  from  the  image  rather  than  on  a prion  knowledge  about  the 
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shapes  of  the  viewed  objects;  and  the  Information  they  specify  concerns  the  depth  or  surface 
orientation  at  arbitrary  points  In  an  Image,  rather  than  the  depth  or  orientation  associated  with 
particular  objects. 

If  one  views  a stereo  pair  of  a complex  surface.  Uke  a crumpled  newspaper  or  the 
"leaves"  cube  of  Ittelson  (I960),  one  can  easily  state  the  surface  orientation  of  any  piece  of  the 
surface,  and  whether  one  piece  Is  nearer  to  or  further  from  the  viewer  than  its  neighbors. 
Neverthelett  one’s  memory  for  the  shape  of  the  surface  is  poor,  despite  the  vivldneu  of  its  surface 
orientation  during  perception.  Furthermore.  If  the  surface  contains  elemenu  nearly  parallel  to  the 
Une  of  sight,  their  apparent  surface  orientation  when  viewed  monocularly  can  differ  from  the 
apparent  surface  orientation  when  viewed  binocularly. 

From  these  observations,  one  can  perhaps  draw  some  simple  Inferences. 

(a)  There  Is  at  least  one  internal  representation  of  the  depth,  or  surface  orienutlon,  or  both, 
associated  with  each  surface  point  In  a scene 

(b)  Because  surface  orientation  can  be  associated  with  unfamiliar  shapes,  lu  representation 
probably  precedes  the  decomposition  of  the  scene  Into  objects.  (This  point  is  particularly  relevant 
(0  our  dlscuulon  of  Intermediate  visual  Information  processing.) 

(c)  Because  the  apparent  orientation  of  a surface  clement  can  change,  depending  on  whether  It  Is 
viewed  binocularly  or  monocularty,  the  representation  of  surface  orientation  Is  probably  driven 
almost  entirely  by  perceptual  processes,  and  Is  Influenced  only  slightly  by  specific  knowledge  of 
what  the  surface  orienutlon  actually  Is.  Our  ability  to  "perceive*  the  surface  much  better  than  we 
can  "memoriae"  it  may  also  be  connected  with  this  point 

In  addition.  It  seems  likely  that  the  different  sources  of  information  can  Influence  the  some 
representation  of  surface  orienutlon. 

S3:  Tht  eompvtatimal  prdbtm 

In  order  to  make  the  most  efficient  use  of  these  different  and  often  complemenUry 
sources  of  Information,  they  need  to  be  combined  In  tome  way.  The  compuutlonal  question  Is, 
how  best  to  do  this?  The  natural  answer  Is  to  seek  tome  repretenutlon  of  the  visual  scene  that 
makes  explicit  Just  the  Information  these  processes  can  deliver. 

Fortunately,  the  physical  InterpreUtlon  of  the  repretenutlon  we  seek  It  clear.  All 
these  processes  deliver  information  about  the  depth  or  surface  orienutlon  sissodated  with  surfaces 
In  an  Image,  and  these  are  well-defined  physical  quantities.  We  therefore  seek  a way  of  making 
this  Information  expHclt.  of  malnulning  It  In  a consistent  tUte,  and  perhaps  also  of  Incorporating 
into  the  repretenutlon  any  physical  constraints  that  hold  for  the  values  that  depth  and  surface 
orientation  take  over  the  kinds  of  surface  that  occur  In  the  real  world.  Table  I llsu  the  type  of 
Information  that  the  different  early  processes  can  extract  from  images.  The  Interesting  point  here 
Is  that  although  processes  Hke  stereo  and  motion  are  m principle  capable  of  delivering  depth 
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Table  1 

The  form  in  which  various  early  visual  processes  de- 
liver information  about  the  changes  in  a scene. 

r = depth 

Sr  = small,  local  changes  in  depth 
Ar  = large  changes  in  depth 
§ = local  surface  orientation 


Information  source 

Stereo 

Motion 

Shading 

Texture  gradients 
Perspective  cues 
Occlusion 


Natural  parameter 


Disparity,  hence  espe- 
cially Sr  and  Ar 
r,  hence  Sr,  Ar 


r 
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information  directly,  they  are  in  practise  more  likely  to  deheer  information  about  local  changts  In 
depth,  for  example  by  measuring  local  changes  in  disparity.  Texture  gradients  and  shading 
provide  more  direct  information  about  surface  orientation.  In  addition,  occlusion  and  brightness 
and  site  chics  can  deHver  information  about  discontinuities  in  depth.  (It  is  for  example  amazing 
how  clear  an  impression  of  depth  can  be  obtained  from  a monocular  image  containing  bright  or 
dim  rectangles  of  different  sizes  against  a dark  background).  The  main  function  of  the 
representation  we  seek  It  therefore  not  only  to  make  explicit  information  about  depth,  local  surface 
orientation,  and  discontinuities  in  these  quantities,  but  alto  to  create  and  maintain  a global 
representation  of  depth  that  Is  consistent  with  the  local  cues  that  these  sources  provide.  We  call 
such  a representation  the  24'D  sketch,  and  the  next  section  describes  a particular  candidate  for  it 

3Ji:  A poislblt  ftm  fw  tkt  2tj-D  sktteh 

The  example  I give  for  the  24*D  sketch  Is  a viewer-centered  representation,  which 
uses  surface  primitives  of  one  (smaH)  size.  It  induda  a representation  of  contours  of  surface 
discontinuity,  and  it  has  enough  internal  computational  structure  to  maintain  iu  descriptions  of 
depth,  surface  orientation  and  surface  discontinuity  in  a consistent  state.  The  representation  itself 
has  no  additional  internal  structure. 

Depth  may  be  represented  by  a scalar  quantity  r,  the  distance  from  the  viewer  of  a 
point  on  a surface.  Surface  discontinuities  may  be  represented  by  oriented  line  elements.  Surface 
orientation  may  be  represented  by  a unit  vector  (x,  y,  z)  in  three-dimensional  space.  Following 
those  who  have  used  gradient  space  (Huffman  NTl,  Horn  1977)  we  can  rewrite  this  as  (p,  q,  i), 
which  can  be  represented  u a vector  (p,  q)  in  twodimensional  space.  In  other  words,  surface 
orientation  may  be  represented  by  covering  an  image  with  needles.  The  length  of  each  needle 
defines  the  dip  of  the  surface  at  that  point,  to  that  zero  length  corresponds  to  a surface  that  is 
perpendicular  to  the  vector  from  the  viewer  to  that  point,  and  the  length  increases  as  the  surface 
tilts  away  from  the  viewer.  The  orientation  of  the  needle  defines  the  direction  of  the  surface's  dip. 
Figure  12  illustrates  this  representation. 

In  principle,  the  relation  between  depth  and  surface  orientation  is  straightforward  — 
one  is  simply  the  integral  of  the  other,  taken  over  regions  bounded  by  surface  discontinuities.  It  is 
therefore  possible  to  devise  a representation  with  intrinsic  computational  facilities  that  can 
nnainuin  the  two  variables,  of  depth  and  surface  orientation,  in  a consistent  state.  But  note  that,  in 
any  such  Kheme,  surfaee  AisetnUnuUUs  acquire  a special  status  (as  curves  across  which  integration 
stops).  Furthermore,  If  the  representation  is  an  active  one,  maintaining  consistency  through  largely 
local  operations,  curves  that  mark  surface  discontinuities  (eg.  contours  that  arise  from  occluding 
contours  In  the  image)  must  be  Yliled  In*  completely,  so  that  at  no  point  along  an  object  boundary 
can  the  Integration  leak  acrou  it  It  la  intereUing  that  subjective  contoura  have  this  property,  and 
that  they  are  closely  related  to  subjective  changes  in  brightneu  (cf  section  1.6)  that  are  often 
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associated  with  changes  in  perceived  depth.  If  the  human  visual  processor  contains  a 
representation  that  resembles  the  24*1^  sketch,  it  would  therefore  be  Interesting  to  ask  whether 
subjective  contours  occur  within  it  (See  Ulbnan  (IVIB)  for  an  analysis  of  the  shape  of  curved 
subjective  contours). 

In  summary,  my  argument  is  that  the  2#}-D  sketch  is  useful  because  it  makes 
explicit  information  about  the  image  in  a form  that  is  closely  matched  to  what  early  visual 
processes  can  deliver.  We  can  formulate  the  goals  of  intermediate  visual  processing  as  being 
primarily  the  construction  of  this  representotion,  discovering  for  example  what  are  the  surface 
orientations  in  a scene,  which  of  the  contours  in  the  primal  sketch  correspond  to  surface 
discontinuities  and  should  therefore  be  represented  in  the  24  sketch,  and  which  contours  are 
missing  In  the  primal  sketch  and  need  to  be  inserted  into  the  24*1)  sketch  in  order  to  bring  it  into 
a state  that  is  consistent  with  the  structure  of  three-dimensional  space  This  formulation  avoids  the 
difficulties  associated  with  the  terms  "region”  and  *obJea”.  and  allows  one  to  ask  precise  questions 
sdsout  the  computational  structure  of  the  24'D  sketch  and  of  processes  to  create  and  maintain  it 
We  are  currently  mudi  occupied  with  these  problems. 


4:  Latur  prooudsing  probleius 


4Jl>:  tntndueOen 

The  24'D  sketch  Is  a poor  representation  for  the  purposes  of  recognition  because  it 
is  unstable  (in  the  tense  of  Marr  k Nishihara  1977),  it  depends  on  the  vantage  point  and  it  falls  to 
make  explicit  pieces  of  a shiqw  (like  an  arm)  that  are  larger  that  the  primitive  size  Except  for  the 
simplest  of  purposes,  it  is  an  inadequate  vehicle  for  a visual  system  to  convey  information  about 
shape  to  other  processes,  and  so  I turn  now  to  representations  that  are  more  soluble  for  recognition 
tasks. 

If  one  were  to  des^[n  a shape  represenutlon  to  suit  the  problems  of  recognition,  one 
would  naturally  base  it  on  an  object-centered  coordinate  system.  In  addition,  one  would  have  to 
Include  shape  primitives  of  many  different  sites,  so  u to  be  able  to  make  explicit  shape 
characteristics  that  can  range  from  a wart  to  an  elephant  Marr  k Nishihara  (1977)  discuss  these 
questions  in  detail,  and  I shall  not  repeat  their  obtervationi  here  The  deepest  issues  are  those 
raised  by  having  to  define  an  object-based  coordinate  system.  Since  they  are  central  to  the 
problem  of  defining  representations  for  use  in  later  processing  of  visual  information,  I shall  spend 
the  remainder  of  the  essay  discussing  this  topic 
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4.1:  Noturt  rf  an  objtct-ctfaered  eotfrdinatt  systm 

Man*  tt  Nlshihara  (1977)  pointed  out  that  there  are  two  types  of  object-centered 
coordinate  system  that  one  might  attempt  to  define  precisely.  One  refers  all  locations  on  an  objea 
to  a single  coordinate  frame  that  embraces  the  entire  object,  and  the  other  distributes  the 
coordinate  system,  making  it  local  to  each  articulated  component  or  individual  shape  characteristic 
Marr  te  Nishihara  concluded  that  the  second  of  these  schemes  Is  the  more  desirable,  and  they  gave 
as  an  example  the  representation  Illustrated  in  figure  IS.  But  with  a represenution  of  this  kind,  the 
most  difficult  (questions  begin  after  Its  Internal  structure  has  been  defined.  How  can  one  define 
canonically  the  coordinate  scheme  for  an  arbitrary  shape,  and  even  more  difficult,  how  can  such  a 
thing  be  found  from  an  image  brfert  a description  of  the  viewed  shape  has  been  computed?  Some 
kind  of  answers  to  these  questions  must  be  found  if  the  representation  is  to  be  used  for 
recognition. 

42:  Shapas  having  natural  coordtnati  systms 

If  the  coordinate  system  used  for  a given  shape  is  to  be  canonical.  Its  definition 
must  take  advantage  of  any  salient  geometrical  characteristics  that  the  shape  possesses.  For 
example,  if  a shape  has  natural  axes,  distinguished  by  length  or  by  symmetry,  then  they  should  be 
used;  The  coordinate  system  for  a sausage  should  take  advantage  of  its  major  axis,  and  for  a face, 
of  Its  axis  of  symmetry. 

Highly  symmetrical  objects,  like  a sphere,  square,  or  circular  disc,  will  inevitably  lead 
to  ambiguities  In  the  choice  of  coordinate  systems.  For  a shape  as  regular  as  a sphere  this  poses  no 
great  problem,  because  its  description  in  all  reasonable  systems  is  the  same.  One  can  even  allow 
other  factors,  like  the  direction  of  motion  or  of  spin,  to  influence  the  choice  of  coordinate  frame. 
For  other  shapes,  the  existence  of  more  than  one  possible  choice  probably  means  that  one  has  to 
represent  the  object  In  several  ways.  This  Is  acceptable  provided  that  the  number  of  ways  is  small. 
For  example,  there  are  four  possible  axes  on  which  one  might  wish  to  base  the  coordinate  system 
for  representing  a door,  the  midlines  along  its  length,  its  width,  its  thickness,  and  to  represent  how 
the  door  opens,  the  axis  of  Its  hinges.  For  a typewriter,  there  are  two  choices  at  the  top  level;  an 
axis  parallel  to  iu  width,  because  that  is  usually  its  largest  dimension,  and  the  axis  about  which  a 
typewriter  is  roughly  symmetrical. 

In  general.  If  an  axis  can  be  distinguished  in  a shape,  It  can  be  used  as  the  basts  for 
a local  coordinate  system.  One  approach  to  the  problem  of  defining  object-centered  coordinate 
systems  is  therefore  to  examine  the  class  of  shapes  having  an  axis  as  an  Integral  part  of  their 
structure.  One  such  Is  the  class  of  gtntrallud  cants.  (A  generalized  cone  is  the  surface  swept  out 
by  moving  a crou  section  of  constant  shape  but  smoothly  varying  size  along  an  axis,  as  in  figure 
H).  Blnford  (1971)  drew  attention  to  this  class  of  surfaces,  suggesting  that  it  might  provide  a 


IS.  This  diagram  illustrates  ‘.he  organization  of  shape  Information  in  a 3-D  model 
description.  Each  box  corresponds  to  a S-D  model;  with  its  model  axis  on  the  left  side  of 
the  box  and  the  arrangement  of  its  component  axes  are  shown  on  the  right  side.  In 
addition  some  component  axes  have  3-D  models  associated  with  them  and  this  is  indicated 
by  the  way  the  boxes  overlap.  The  relative  arrangement  of  each  model’s  component  axes, 
however,  is  shown  improperly  since  it  should  be  in  an  object-centred  system  rather  than  the 
viewer-centred  projection  used  here.  This  example  shows  a coarse  overall  deKription  of  a 
human  shape  along  with  an  elaboration  of  one  of  Itt  componenu  (the  arm).  T.'he  important 
characteristics  of  this  type  of  organlution  are  (i)  each  3-D  model  is  a self-  contained  unit 
of  shape  information  and  has  a limited  complexity,  (li)  information  appears  in  shape 
contexu  appropriate  for  recognition  (the  disposition  of  a finger  is  most  stable  when 
specified  relative  to  the  hand  that  contains  it),  and  (lii)  the  representation  can  be  used 
flexibly  (componenu  can  be  elaborated  according  to  the  needs  of  the  moment  or  the  time 
available,  and  a S-D  model  description  of  a component  is  easily  added  to  a description  of 
the  whole  shape).  The  major  limitation  Imposed  on  the  representation  by  this  form  of 
oraganizatlon  is  on  iu  Kope,  since  it  will  only  be  useful  for  shapes  for  which  the 
decomposition  into  S-D  models  is  well  defined.  (Marr  tc  Nishihara  1977  figure  3). 


F 





14.  The  definition  of  a generalized  cone.  In  thli  article,  a generalized  cone  li  the  surface 
generated  by  moving  a smooth  cn>u*sectlon  p along  t straight  axis  A.  The  cross-section 
may  vary  smoothly  In  size  (as  prescribed  by  the  function  A(z)),  but  Its  shape  remains 
constant.  The  eccentricity  of  the  cone  Is  the  angle  f between  Its  axis  and  a plane 
containing  a crou-section.  (Figure  5 of  Marr  1977). 
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Kicks  ihown  here  and  natural  or  canonical  axes  of  the  shapes  described.  To  be  useful  for 
re^Mon.  a shape  repreMmathm  must  be  based  on  diaracterlstia  that  are  unl<|uelT  defined  bv 

the  shape  and  which  can  be  derived  rMaUy  from  Images  of  It  (Marr  ft  Nishihan  1977  figure  I). 
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convenient  way  of  describing  three-dimensional  surfaces  for  the  purposes  of  computer  vision.  I 
regard  it  as  an  important  class  not  because  the  shapes  themselves  are  easily  decribable,  but  because 
the  presence  of  an  axis  allows  one  to  define  a canonical  local  coordinate  system.  Fortunately  many 
objects,  especially  those  whose  shape  was  achieved  by  growth,  are  described  quite  naturally  in  terms 
of  one  or  more  generalized  cones.  The  animal  shapes  In  figure  15  provide  some  examples  — the 
individual  sticks  are  simply  axes  of  generalized  cones  that  approximate  the  shapes  of  parts  of 
these  animals.  Many  artifacts  can  also  be  described  in  this  way,  like  a car  (a  small  box  sitting  atop 
and  in  the  middle  of  a longer  one),  and  a building  (a  box  with  a vertical  axis). 

It  is  important  to  remember  that  there  exist  surfaces  that  cannot  conveniently  be 
approximated  by  generalized  cones,  for  example  a cake  that  has  been  cut  at  its  intersection  with 
some  arbitrary  plane,  or  the  surface  formed  by  a crumpled  newspaper.  Cases  like  the  cake  can  be 
dealt  with  by  introducing  suitable  surface  primitives  that  describe  the  plane  of  the  cut,  but  the 
crumpled  newspaper  poses  apparently  intractable  problems. 

4.3:  Finding  the  natural  coordinate  system  from  an  Image 

Even  if  a shape  possesses  a canonical  coordinate  system,  one  is  still  faced  with  the 
problem  of  finding  it  from  an  image.  Blum  (I97S),  Agin  (1972)  and  Nevatia  (1974)  have  addressed 
problems  that  are  related  to  this  question.  Blum’s  sym-axls  theory  is  an  interesting  one,  because  he 
specifies  precisely  what  it  is  that  is  computed  from  a two-dimensional  outline.  Unfortunately,  it  is 
not  clear  that  what  this  theory  computes  is  in  fact  useful  for  shape  recognition  (see  e.g.  figure  16), 
and  when  applied  to  a three-dimensional  shape,  the  sym-axis  is  in  general  a two-dimensional  sheet, 
so  it  cannot  easily  be  used  to  define  an  object-centered  coordinate  system.  Agin's  and  Nevatia's 
work,  on  the  other  hand,  concerns  the  analysis  of  a depth  map.  This  is  an  important  problem,  and 
it  would  be  interesting  to  see  a careful  analysis  of  the  conditions  under  which  their  techniques  will 
succeed. 

My  own  interest  in  the  problem  grew  from  the  3-D  representation  theory  of  Marr  & 
Nishlhara  (1977),  in  particular  from  the  question  of  how  to  interpret  the  outlines  of  objects  as  seen 
in  a two-dimensional  Image.  The  rest  of  this  essay  summarizes  a recent  article  by  Marr  (1977).  The 
starting  point  for  this  work  was  the  observation  that  when  one  looks  at  the  silhouettes  in  Picasso’s 
work  Hites  of  Spring”  (figure  IT),  one  perceives  them  in  terms  of  very  particular  three-dimensional 
shapes,  some  familiar,  some  less  so.  This  is  quite  remarkable,  because  the  silhouettes  could  in 
theory  have  been  generated  by  an  infinite  variety  of  shapes  which,  from  other  viewpoints,  have  no 
discernable  similarities  to  the  shapes  we  perceive.  One  can  perhaps  attribute  part  of  the 
phenomenon  to  a familiarity  with  the  depicted  shapes;  but  not  all  of  it,  because  one  can  use  the 
medium  of  a silhouette  to  convey  a new  shape,  and  because  even  with  considerable  effort  it  Is 
difficult  to  imagine  the  more  blurre  three-dimensional  surfaces  that  could  have  given  rise  to  the 
same  silhouettes.  The  paradox  is.  that  the  bounding  contours  in  figure  17  apparently  tell  us  more 


16.  Blum'i  (1973)  gruifln  technique  for  recovering  an  axli  from  a illhouette  Is  undesirably 
sensitive  to  small  perturbations  In  the  contour.  16a  shows  the  Blum  transform  of  a 
rectangle,  and  16b,  of  a rectangle  with  a notch.  (Redrawn  from  Agin  1972). 
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17.  "Rites  of  sprint"  by  P.  Pfcassa  We  Immediately  taiterpret  the  silhouettes  in  terms  of  particular 
S"D  surfaces,  despite  the  paucity  of  information  in  die  image.  In  order  to  do  this,  we  must  be 
bringing  additional  assumptions  and  constraints  to  bear  on  the  analysU  of  these  contours*  shapes. 
Marr  WTt)  enqubed  about  dw  nature  of  dils  a prfsrf  informatlan. 
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18.  From  Ticwpolnt  K.  the  three  dlinen«k>ntl  surface  2 fonm  the  silhouette  In  the  Image  vta  the 
IfiMglng  prooeu  t.  The  boundary  of  5y,  obtained  by  the  boundary  operator  d Is  denoted  by  Cy 
and  we  can  it  the  contour  of  2.  The  set  of  points  on  2 that  s maps  onto  Cy  wt  caH  the  contour 
generator  of  Cy,  and  It  Is  denoted  by  Vy-  The  map  from  2 to  Fp  Induced  by  d is  denoted  by  i. 
(Figure  2 of  Marr  I997)i 
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than  they  should  about  the  shape  of  the  dark  figures.  For  example,  neighboring  points  on  such  a \ 

contour  could  in  general  arise  from  widely  separated  points  on  the  original  surface,  but  our  i 

perceptual  Interpretation  usually  Ignores  this  possibility.  ] 

The  first  observation  to  be  made  here  is  that  the  occluding  contours  that  bound  | 

these  silhouettes  are  contours  of  surface  discontinuity,  that  is  precisely  the  contours  with  which  the  | 

24  *D  sketch  Is  concerned.  Secondly,  because  we  can  interpret  them  as  three-dimensional  shapes,  j 

then  implicit  In  the  way  we  Interpret  them  must  lie  some  a priori  assumptions  that  allow  us  to  infer  | 

a shape  from  an  outline.  If  a surface  violates  these  assumptions,  our  analysis  will  be  wrong,  in  the  | 

sense  that  the  shape  we  assign  to  the  contours  will  differ  from  the  shape  that  actually  caused  them.  j 

An  everyday  example  of  this  phenomenon  Is  the  shadowgraph,  where  the  appropriate  arrangement  | 

of  one’s  hands  can,  to  the  suiprlse  and  delight  of  a child,  produce  the  shadow  of  an  apparently  | 

quite  different  shape,  like  a duck  or  a rabbit  | 

What  assumptions  is  it  reasonable  to  suppose  that  we  make?  In  order  to  explain  j 

them,  I need  to  define  the  four  structures  that  appear  In  figure  18.  These  are  (1)  some  three  | 

dimensional  surface  2;  (2)  its  image  or  silhouette  Sy  as  seen  from  a viewpoint  V\  (3)  the  bounding  j 

contour  Cy  of  and  (4)  the  set  of  polnu  on  the  surface  2.  that  project  onto  the  contour  Cy. 

We  shall  call  this  last  set  the  contour  generator  of  Cy,  and  we  shall  denote  it  by  Ty  | 

y,'..  If  one  is  presented  with  a contour  in  an  image,  without  any  knowledge  of  the  j 

surface  or  perspective  that  caused  it,  there  is  very  little  information  on  which  one  can  base  one's 
analysis.  The  only  obvious  feature  available  Is  the  distinction  between  convex  and  concave  pieces 
of  contour  - that  is,  the  presence  of  inflection  points.  In  order  that  inflection  points  be  "reliable'’, 
one  needs  to  make  some  assumptions  about  the  way  the  contour  was  generated,  and  I chose  the 
following  restrictlons: 

Rl:  The  surface  2 is  smooth. 

R2:  Each  point  on  the  contour  generator  Ty  projects  to  a different  point  on  the  contour  Cy. 

R^:  Nearby  points  on  the  contour  Cy  arise  from  nearby  points  on  the  contour  generator  Ty. 

R4:The  contour  generator  Ty  <fCy  is  planar. 

The  first  restriction  Is  only  a technical  one.  The  second  and  third  say  that  each 
point  on  the  contour  in  the  image  comes  from  one  point  on  the  surface  (which  is  an  assumption 
that  facillutes  the  analysis  but  is  not  of  fundamental  importance),  and  that  where  the  surface  looks 
continuous  In  the  Image,  it  really  is  continuous  in  three  dimensions.  The  fourth  condition,  together 
with  the  constraint  that  the  imaging  process  be  an  orthogonal  projection,  is  simply  a necessary  and 
sufficient  condition  that  the  difference  between  convex  and  concave  contour  segmenu  reflecu 
properties  of  the  surface,  rather  than  characteristics  of  the  Imaging  process. 

It  turns  out  that  the  following  theorem  Is  true,  and  It  Is  a result  that  I found  very 
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Thmm.  If  Ri  Is  true,  and  R2  • R4  field  for  all  distant  viewing  directions 
that  he  toi  some  plane,  then  the  viewed  surface  is  a generallied  cone 

This  means  that  if,  for  distant  viewpoints  whose  viewing  directions  lie 
parallel  to  some  plane,  a surface’s  shape  can  succeufully  be  inferred  using  only  the 
convexities  and  concavities  of  its  bounding  contoun  in  an  image,  then  that  surface  is  a 
generalised  cone  or  is  composed  of  several  such  cones.  The  interesting  thing  about  this 
result  is  that  it  implicates  generallxed  cones.  We  have  already  seen  that  the  important  thing 
about  these  cones  is  that  an  axis  forms  an  integral  part  of  their  structure.  But  this  is  a 
feature  of  their  three-dimensional  organisation,  and  ought  in  some  sense  to  be  Independent 
of  the  Issues  raised  by  vision.  What  the  theorem  says  is  that  there  is  a natural  link  between 
generaUnd  cones  and  the  imaging  proeeu  itself.  The  combination  of  these  two  must  mean, 
I think,  that  generalised  cones  will  play  an  intimate  role  in  the  development  of  vision 
theory. 

4.4:  Inttrfiritlng  tht  fmagv  a Onfit  gmmltud  cant 

If  we  take  this  result  at  face  value,  we  can  now  ask  an  obvious  question.  Let 
us  assume  that  our  data  consist  of  contours  of  surface  discontinuity  in  the  image  of  a 
generalised  cone,  since  without  this  assumption  we  can  deduce  nothing.  How  may  such 
contours  be  Interpreted?  To  specify  a generalised  cone,  we  have  to  specify  its  axis  A.  cross- 
section  $(§),  and  axial  scaling  function  k(z)  (figure  14>,  how  can  we  discover  them  from  an 
image? 

The  answer  to  this  question  is  based  on  the  notion  of  the  sktUtan  of  a 
generalised  cone  The  skeleton  Is  not  a difficult  idea,  since  it  is  very  like  the  set  of  lines  a 
cartoonist  draws  to  convey  the  shape  of  a curved  object  It  conslstt  of  three  classes  of 
contour:  (a)  the  contoun  that  occur  in  a generalised  cone’s  silhouette;  (b)  the  contours  that 
arise  from  maxima  and  minima  in  a cone’s  axial  scaling  function  (called  the  cone’s  radial 
txtramttttsh  and  (c)  contours  that  arise  from  maxima  and  minima  in  the  cone’s  cross- 
section  (Its  yiitftnf).  These  categories  are  illustrated  in  figure  19. 

The  reason  why  the  skelsion  is  a useful  construct  for  recognition  is  that  one 
can  detect  Its  presence  in  an  image  by  the  many  relationships  that  exist  among  its  parts. 
For  example,  radial  extremitiei  are  all  parallel  to  each  other,  and  the  silhouette  and  fluting 
have  a kind  of  symmetry  about  the  image  of  tN  coneys  axb.  It  turns  out  that  one  can  use 
these  relationships  to  set  up  constraints  on  a set  of  contours  such  that,  if  those  constraints 


22.  This  figure  illuitntei  the  types  of  slde-to^d  Join  that  can  occur  between  two  short 
generallied  cones.  In  the  first  column,  the  left-hand  cone  Is  convex;  In  the  center  column  It 
Is  concave,  and  In  the  third  column.  It  Is  convex  on  one  side  of  the  Join,  and  concave  on  the 
other.  The  other  cone  Is  convex  In  the  top  row,  and  concave  In  the  other  two. 
Segmentation  depends  upon  finding  the  points  P and  Q,  which  are  defined  by  theorem  7 
of  Marr  (1977)  and  Illustrated  here  for  each  case.  (Figure  16  of  Marr  1977). 
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are  all  satisfied  by  a unique  Interpretation  of  the  contours  in  the  Image,  one  can  be 
reasonably  certain  that  a skeleton  has  been  found,  and  hence  that  the  contours  can  be 
Interpreted  as  arising  from  a generalised  cone  whose  axis  is  then  determined.  The  practical 
Importance  of  this  result  Is  Illustrated  In  figure  20,  where  one  can  see  that  the  image  of  the 
"sides”  Is  symmetrical  about  the  bucket’s  axis,  and  there  Is  a clear  parallel  relationship 
between  the  Image  of  the  bucket’s  top,  the  corrugations  In  its  side,  and  the  visible  part  of  its 
base  (the  bucket's  radial  extremities).  These  relations,  of  symmetries  and  parallelism,  are 
preserved  by  an  orthogonal  projection.  Hence  provided  that  the  contours  are  formed  along 
a viewing  direction  that  Is  not  too  close  to  the  axis  of  the  cone,  these  relations  will  still  be 
present  In  the  image.  If  the  viewing  direction  lies  so  close  to  the  cone’s  axis  that  Its  Image 
Is  substantially  foreshortened,  these  relationships  will  no  longer  be  present,  but  it  is  part  of 
the  overall  theory  that  such  views  have  to  be  handled  differently  (Marr  k Nlshlhara  1977). 

4J:  Stufacts  amposed  of  two  or  mort  gmtraltud  corns 

Real-life  objecu  are  often  approximately  composed  of  several  different 
cones.  Joined  together  In  various  ways  (see  figure  19),  and  we  therefore  have  to  study  ways 
of  decomposing  a multiple  cone  Into  Its  components  - for  example,  a human  body  into 
arms,  legs,  torso  and  head.  Marr  (1977)  analyzed  the  two  types  of  Join  shown  in  figure  21, 
giving  criteria  that  define  segmentation  points  on  the  contour  produced  by  two  Joined  cones 
(theorems  7 and  8).  Figure  22  exhibits  the  segmentation  points  P and  Q for  the  case  in 
which  two  short  cones  are  Joined  side-to^d.  P.  Vatan  has  written  a computer  program 
that  can  carry  out  this  segmentation,  and  an  example  of  Its  operation  Is  Illustrated  in  figure 
29.  The  legend  to  the  figure  describes  the  particular  algorithm  used. 

4j6:  Semt  comounts  on  the  ttnUtattons  of  this  thtorj 

The  results  of  this  theory  are  Umlted  In  their  scope  to  a particular  class  of 
views  and  surfaces,  but  on  the  other  hand,  they  use  only  a limited  kind  of  visual 
Infoimation,  little  more  than  occluding  contours  that  are  formed  in  an  image  by  rays  that 
graze  a smooth  surface.  Interestingly,  these  particular  contours  are  unsuitable  for  use  in 
stereopsis  or  structure-from-motion  computations,  because  they  are  not  formed  from 
markings  that  define  precise  locatlont  on  the  viewed  surface.  Creases  and  folds  on  a 
surface  also  give  rise  to  contoun  in  an  image,  and  these  have  yet  to  be  studied  In  detail. 
Information  about  shape  from  shading,  texture,  stereo  or  motion  information  has  not  yet 
been  considered.  By  adding  these  other  sources  of  Information,  I hope  that  a set  of 
methods  can  eventually  be  assembled  that  together  approach  a comprehensive  treatment  of 
possible  Image  configurations. 
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21  Tht  ocdudinf  contours  in  an  image  can  be  used  to  locate  the  images  of  the  natural  axes 
of  a shape  composed  of  generalised  cones  (Marr  1977).  The  initial  outline  in  (a)  was 
obtalnod  bf  applying  local  grouping  proonies  to  the  frtmti  tktteh  of  the  image  of  a toy 
donkey  (Marr  1076).  This  outline  was  then  smoothed  and  divided  into  convex  and  concave 
loctlom  to  get  (bX  Next,  strong  segmentation  points.  Hke  the  deep  concavity  circled  in  (c). 
are  Identified  and  a set  of  heuristic  rules  are  used  to  connect  them  with  other  poinu  on  the 
contour  to  got  the  segmentation  shown  in  (dX  The  component  axes  shown  in  (e)  are  then 
derived  from  these.  The  resulting  segments  are  checked  to  see  that  they  obey  the  rules  for 
images  of  generaHxod  cones.  The  boundaries  must  for  example  be  symmetric  about  the 
axes,  and  In  the  case  of  side-to-end  Joins,  the  axis  of  the  cone  that  is  attached  by  its  end 
must  Intersect  the  segmentation  points  that  separate  the  two  cones*  contours.  In  this 
example,  most  of  the  symmetry  relations  have  degenerated  into  parallelism.  The  thin  lines 
in  (0  htdicato  the  position  of  the  head.  hg.  and  tall  components  along  the  torso  axis,  and 
the  snout  and  ear  components  along  the  head  axis.  (This  algorithm  is  due  to  P.  Vatan). 


TABLE  2 


A franework  for  the  derivation  of  shape  infomation  from  images. 


iriAGE(S) 


PRIMAL 
SKETCH (ES) 


Describes  the  intensity  changes  present 
in  an  image,  labels  distinguished  loca- 
tions like  termination  points,  and  makes 
explicit  local  two-dimensional  geometrical 
relations. 


2 1/2-D 
SKETCH 


Represents  contours  of  surface  discon- 
tinuity, and  depth  and  orientation  of 
visible  surface  elements,  in  a coordinate 
frame  that  is  centered  on  the  viewer. 


3-D  MODEL 
REPRESENTATION 


Shape  descriptions  that  include  volumetric 
shape  primitives  of  a variety  of  sizes, 
whose  positions  are  defined  using  an  object- 
centered  coordinate  system.  This  repre- 
sentation imposes  considerable  modular 
organization  on  its  descriptions. 
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6:  DlaouMton 

I have  tried  to  make  two  main  points  In  this  article.  The  first  Is  ) 

methodological,  namely  that  It  Is  important  to  be  very  clear  about  the  nature  of  the 
understandinf  we  seek.  The  results  we  try  to  achieve  should  be  precise  ones,  at  the  level  of 
what  I called  computational  theory,  and  one  should  try  to  choose  problems  that  can 
confidently  be  attributed  to  a real  aspect  of  vision,  and  not  (for  example)  an  artifact  of  the  { 

limitations  of  one's  current  vision  profram.  { 

The  second  main  point  is  that  the  critical  issua  for  vision  seem  to  me  to  j 

revolve  around  the  nature  of  the  representations  used,  and  the  nature  of  the  processes  that  | 

create,  maintain  and  read  them.  I have  suggested  an  overall  framework  for  visual  ! 

information  processing  that  oonilsti  of  three  principal  representations,  the  primal  sketch, 
the  ibg-D  sketch  and  the  S-D  model  representation,  and  it  is  summariied  in  table  2.  As  we 
study  the  processes  capable  of  creating,  maintaining  and  reading  these  representations,  it  is 
essential  to  make  explicit  their  inherent  limitations  together  with  the  assumptions  that  are 
Implicit  in  their  design.  In  addition,  one  should  try  hard  to  test  experimentally  any 
oonchiskNu  to  which  these  studies  lead,  because  it  is  foolish  to  ignore  the  clues  and  tests 
available  from  the  disdplinet  of  experimental  psychology,  neurophysiology  and  clinical 
neurology,  even  though  it  is  often  difficult  to  use  this  information  fruitfully. 
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